Resolving Translation Ambiguity using Monolingual Corpora. A Report on Clairvoyance CLEF-2002 Experiments

نویسندگان

  • Yan Qu
  • Gregory Grefenstette
  • David A. Evans
چکیده

Choosing the correct target words is a difficult problem for machine translation. In cross-language information retrieval, this problem of choice is mitigated since more than one translation alternative can be retained in the target query. Between choosing just one word as a translation and keeping all the possible translations for each source word, one can apply a range of filtering techniques for eliminating some words and keeping others. In the bilingual track of CLEF 2002, focusing on word translation ambiguity, we experimented with several techniques for choosing the best target translation for each source query word by using co-occurrence statistics in a reference corpus consisting of documents in the target language. One of two distinct corpora was used, the target-language test corpus or the World Wide Web. Our techniques give one best translation per source query word. We also experimented with combining these word choice results (providing up to three translations for each word) in the final translated query. The source query languages were Spanish and Chinese; the target language documents were in English. We submitted four automatic runs for each language pair. When the methods were combined, mixing results obtained with different reference corpora, the recall and average precision of Spanish-to-English retrieval reached 95% and 97%, respectively, of the recall and average precision of an English monolingual retrieval run. For Chinese-to-English text retrieval, the recall and average precision reached 89% and 60%, respectively, of the English run.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Resolving Translation Ambiguity and Target Polysemy in Cross-Language Information Retrieval

This paper deals with translation ambiguity and target polysemy problems together. Two monolingual balanced corpora are employed to learn word co-occurrence for translation ambiguity resolution, and augmented translation restrictions for target polysemy resolution. Experiments show that the model achieves 62.92% of monolingual information retrieval, and is 40.80% addition to the select-all mode...

متن کامل

EXETER at CLEF 2002: Experiments with Machine Translation for Monolingual and Bilingual Retrieval

This year, the University of Exeter participated in both the CLEF 2002 monolingual and bilingual task for two languages: Italian and Spanish. We submitted 4 ranked results each for both Italian and Spanish Monolingual tasks and 5 each for the bilingual tasks. We report experimental results from our investigations of merging topic translations from two machine translation (MT) systems and recent...

متن کامل

Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation

Resolving coordination ambiguity is a classic hard problem. This paper looks at coordination disambiguation in complex noun phrases (NPs). Parsers trained on the Penn Treebank are reporting impressive numbers these days, but they don’t do very well on this problem (79%). We explore systems trained using three types of corpora: (1) annotated (e.g. the Penn Treebank), (2) bitexts (e.g. Europarl),...

متن کامل

Clairvoyance CLEF-2003 Experiments

In CLEF 2003, Clairvoyance participated in the bilingual retrieval track with the German and Italian language pair. As we did not have any German-to-Italian translation resources, we used the Babel Fish translation service provided by Altavista.com for translating German topics into Italian, with English as a pivot language. Then the translated Italian topics were used for retrieving Italian do...

متن کامل

A Language-Independent Approach to European Text Retrieval

We present an approach to multilingual information retrieval that does not depend on the existence of specific linguistic resources such as stemmers or thesaurii. Using the HAIRCUT system we participated in the monolingual, bilingual, and multilingual tasks of the CLEF-2000 evaluation. Our method, based on combining the benefits of words and character n-grams, was effective for both language-in...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002